Artificial Intelligence in Omics

Department of Physics, School of Science, Tianjin University, Tianjin 300072, China Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN 46202, USA 4 IUPUI Fairbanks School of Public Health, Indianapolis, IN 46202, USA Regenstrief Institute, Indianapolis, IN 46202, USA Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA Department of Biomedical and Health Informatics, The Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA

care; and 4) AI-based approaches for protein structure prediction, gene function prediction, and drug discovery.
With enthusiastic responses to our call for submissions, we are pleased to announce that 15 articles have been selected for publication in this special issue, including four review articles, two original research articles, and nine method articles. A list of original studies and tools reported in this special issue is provided in Table 1.
Among the four review articles, Brendel et al. provided an overview of the application of deep learning (DL) models to single-cell RNA sequencing (scRNA-seq) data analysis, and discussed the current challenges and future opportunities in this field [1]. Stanojevic et al. presented an in-depth review on the computational methods for integrating multi-omics data from the same single cells or aligning multi-modal data from different cells, providing a detailed technical summary of currently available methods [2]. Li et al. surveyed machine learning (ML) approaches in lung cancer research, highlighting the challenges and opportunities for integrating complex biomedical data to improve lung cancer diagnosis and therapy [3]. Zha et al. reviewed current methods for microbiome data mining and knowledge discovery, with a focus on AI methods for elucidating microbial communities and their spatiotemporal dynamic patterns [4].
In an original research article, Zhao et al. studied the application of explainable ML models to transcriptomics data. They performed a comprehensive evaluation of multiple explainers and proposed optimization strategies to improve model reproducibility and interpretability. This work provides new insights and guidelines on the use of explainable ML models for exploring novel biological mechanisms [5].
Most of the articles in this special issue are method articles, reporting AI-based tools for various omics applications. Lee et al. introduced SOPHIE (Specific cOntext Pattern Highlighting In Expression data), which uses a generative neural network to separate common and context-specific transcriptional patterns. SOPHIE can distinguish common differentially expressed genes (DEGs) that are frequently altered across different biological contexts, and context-specific DEGs that are relevant for particular experimental conditions [6].
Zhang et al. reported DGMP (Directed Graph convolutional network and Multilayer Perceptron), a novel ML-based method for identifying cancer driver genes from multi-omics pan-cancer data. DGMP combines directed graph convolutional network to make use of diverse gene features and regulatory information in the multi-omics data, and multilayer perceptron to weigh preferentially on gene features. DGMP outperforms multiple state-of-the-art methods and identifies non-mutated cancer driver genes harboring epigenetic or expression alterations [7].
Wan et al. presented scEMAIL (Expert enseMble novel cell-type perception and local Affinity constraInts of muLtiorder for scRNA-seq data). scEMAIL is a universal and source-free transfer learning-based annotation framework for scRNA-seq data. It can automatically identify novel cell types without using source data [8].
Zhou et al. reported DeeReCT-TSS (Deep Regulatory Code and Tools-Transcription Start Site), a DL-based method for genome-wide prediction of transcription start sites (TSSs). DeeReCT-TSS incorporates both DNA sequence data and conventional RNA-seq data as inputs, and substantially outperforms existing methods for TSS prediction based on DNA sequence data alone [9].
Shan et al. presented TIST (transcriptome and histopathological image integrative analysis for spatial transcriptomics), a novel analytical tool for spatial transcriptomics (ST) data. By integrating matched ST data and histopathological images, TIST identifies spatial clusters and enhances spatial gene expression patterns. TIST outperforms multiple start-of-theart methods, as benchmarked on both simulated and real datasets [10].
Yang et al. developed DeepNoise, a semi-supervised DLbased model to distinguish true biological signals from experimental noise. The authors used DeepNoise to identify and classify the phenotypic effect of 1108 genetic perturbations based on 125,510 fluorescent microscopy images, achieving a high performance among competing methods [11].
In another original research article, Zhang et al. developed MAPD (model-free analysis of protein degradability), a ML method to predict protein degradability via proteinintrinsic features. MAPD achieves a high accuracy in predicting kinases that may be subject to targeted protein degradation, and may generalize to non-kinase proteins. The authors also identified important features predictive of protein degradability [12]. Xu and Zhao curated a large benchmark dataset of linear B-cell epitopes (BCEs), which play a critical role in immune responses. Based on this dataset, the authors developed NetBCE, a ten-layer interpretable deep neural network to predict linear BCEs. NetBCE substantially outperforms conventional ML methods, and reveals distinct features of BCEs [13]. Zhu et al. presented TripletGO, a novel hierarchical method to predict gene functions and specifically Gene Ontology (GO) terms by combining transcript expression profiles and protein homology inferences. TripletGO substantially improves the accuracy in predicting gene functions as compared to current state-of-the-art methods, in large part attributed to a novel triplet network method that can effectively boost function prediction using transcript expression profiles [14].
Last but not least, Wei et al. reported DrSim (similarity learning for drug discovery). As a learning-based framework, DrSim automatically infers similarity between transcriptional profiles. DrSim outperforms existing methods based on in vitro and in vivo datasets related to drug annotation and repositioning. DrSim may be useful for phenotypic drug discovery based on high-throughput transcriptional perturbation data [15].
Overall, the 15 articles in this special issue showcase the broad applications and powerful utilities of AI in omics. We anticipate that new breakthroughs in omics-driven biomedical research will be made by harnessing the enormous power of AI. GPB will continue to provide a platform for AI-based tools and discoveries in omics.